npj Digital Medicine — Latest Matching Preprints

1

ChatCLIDS: Simulating Persuasive AI Dialogues to Promote Closed-Loop Insulin Adoption in Type 1 Diabetes Care

Yao, Z.; Chafekar, T.; Wang, J.; Han, S.; Ouyang, F.; Qian, J.; Li, L.; Yu, H.

2025-09-04 medical education 10.1101/2025.09.02.25334973 medRxiv

Top 0.1%

87.3%

Show abstract

Real-world adoption of closed-loop insulin delivery systems (CLIDS) in type 1 diabetes remains low, driven not by technical failure, but by diverse behavioral, psychosocial, and social barriers. We introduce ChatCLIDS, the first bench-mark to rigorously evaluate LLM-driven persuasive dialogue for health behavior change. Our framework features a library of expert-validated virtual patients, each with clinically grounded, heterogeneous profiles and realistic adoption barriers, and simulates multi-turn interactions with nurse agents equipped with a diverse set of evidence-based persuasive strategies. ChatCLIDS uniquely supports longitudinal counseling and adversarial social influence scenarios, enabling robust, multi-dimensional evaluation. Our findings reveal that while larger and more reflective LLMs adapt strategies over time, all models struggle to overcome resistance, especially under realistic social pressure. These results highlight critical limitations of current LLMs for behavior change, and offer a high-fidelity, scalable testbed for advancing trustworthy persuasive AI in healthcare and beyond. 1

2

Remote Perioperative Symptom Monitoring via Smartphone: A largescale feasibility analysis

Frumkin, M.; Messner, G.; Holzer, K.; Xu, Z.; Rodebaugh, T.; Bernstein, H.; Frey, K.; Ahuja, S.; Lu, C.; Haroutounian, S.

2025-07-28 surgery 10.1101/2025.07.27.25332242 medRxiv

Top 0.1%

77.5%

Show abstract

Ecological momentary assessment (EMA) holds promise for perioperative monitoring, yet large-scale feasibility data are lacking. The Personalized Prediction of Persistent Postsurgical Pain (P5) study enrolled 2,500 adults undergoing major surgery at a single center. EMA consisting of 15 items was administered three times daily via smartphone. Participants were not directly incentivized for EMA compliance nor excluded for non-compliance. Approximately 90% of participants completed at least some EMA. Average preoperative compliance was 66% (Median=79%) and average postoperative compliance was 60% (Median=71%) in the first 30 days after surgery. Postoperative compliance differed by surgical site, with lowest compliance among vascular and cardiothoracic patients. Demographic characteristics, including race, insurance status, and education, were associated with compliance. Overall, perioperative EMA appears feasible. Appropriate handling of missing data is critical to ensure models are generalizable to individuals who hold marginalized identities.

3

Hippocrates-o1: A Guideline-Aware, Orchestrated, Self-Refining Protocol for Specialty-Specific Clinical Reasoning

Wang, B.; Schaefer, E.; Aguirre, N.; Huang, J.; Li, X.; Kolamuri, S. R.; Ying, L.; Li, X.; Chao, G.; Ang, S. S.; Vallabhajosyula, P.; Krumholz, H.; Gibbs, K. E.; Pai, S. I.; Schneider, E. B.; Cohan, A.; Ong, C. S.

2025-12-05 surgery 10.64898/2025.12.04.25341678 medRxiv

Top 0.1%

75.8%

Show abstract

BackgroundClinical decision support requires language models that provide guideline-aligned, context-aware reasoning with clear justification. Many existing benchmarks emphasize multiple-choice or short-form question answering and mainly capture factual recall rather than longitudinal clinical reasoning from extended clinical notes. Hippocrates-o1 is a family of domain-tailored clinical reasoning pipelines that combine structured prompts, guideline-informed retrieval, and iterative self-refinement across oncology, general surgery, and vascular surgery. MethodsReal-world head and neck cancer cases were drawn from the MIMIC-IV-Note database, with a subset (n=20) randomly selected for detailed annotation. Six physicians adjudicated treatment phase and intent using structured criteria and rated model outputs. For each case, we generated outputs using both a general-purpose baseline model (VanillaLLM) and our oncology-specific reasoning model, Hippocrates-Karkinos-o1. Experts evaluated the outputs across five dimensions on a scale of 1 to 5: Clinical Knowledge Application, Contextual Understanding, Reasoning Transparency, Chain-of-Thought Quality, and Hallucination Audit. Overall Reasoning was the mean of domain scores. To explore whether the approach could extend beyond oncology, we also processed inguinal hernia and aortic aneurysm cases through Hippocrates-Chirurgos-o1 and Hippocrates-Angios-o1 domain adaptations. ResultsAcross paired ratings, Hippocrates-Karkinos-o1 improved Overall Reasoning from 3.40{+/-}0.90 to 4.00{+/-}0.73 (p<0.001). Domain scores increased for Clinical Knowledge Application (2.87{+/-}1.20 to 3.70{+/-}1.03), Contextual Understanding (3.48{+/-}0.95 to 3.98{+/-}0.95), Hallucination Audit (3.90{+/-}1.32 to 4.74{+/-}0.76), Reasoning Transparency (3.45{+/-}1.02 to 3.86{+/-}0.87), and Chain-of-Thought Quality (3.32{+/-}1.04 to 3.69{+/-}1.00), all p[≤]0.001. Surgical and vascular adaptations showed parallel qualitative improvements. ConclusionsThe Hippocrates-o1 protocol improved reasoning fidelity, guideline alignment, and factual grounding relative to a general-purpose model and generalized across oncology, surgery, and vascular care. Orchestrated retrieval and self-refinement provide a reproducible template for evaluating and enhancing clinical reasoning in medical AI.

4

A Virtual Patients Ensemble Approach for Predicting Surgical Complications

Neuman, Y.; Cohen, Y.; Neuman, Y.

2025-09-22 surgery 10.1101/2025.09.21.25336262 medRxiv

Top 0.1%

75.0%

Show abstract

AI has shown promise in predicting surgical complications, but most existing models estimate overall risk levels rather than identifying the specific complications an individual patient may develop. We present an AI agent that uses a Virtual Patients Ensemble (VPE) approach to generate individualized predictions of surgical complications from unstructured case descriptions. The agent applies structured reasoning to extract diagnoses, surgical procedures, and risk factors from clinical narratives. From this profile, it generates a cohort of N virtual patients, each a plausible variation of the original case. This ensemble captures uncertainty in patient-specific risk factors and grounds LLM-based clinical reasoning in individualized clinical scenarios. For each virtual patient, the agent predicts the most likely complications, and a final distribution is presented over the virtual patients. The agent was evaluated on 1440 case reports from the PMC-Patients dataset, of which 186 met the inclusion criteria. Predictive performance was compared with both null-hypothesis expectations and baseline LLM predictions. The agent correctly identified 32% of the observed complications, significantly outperforming the null-hypothesis baseline and a baseline prediction generated by the LLM. Unlike risk calculators or machine-learning models trained on population averages, this approach derives predictions directly from a patients clinical profile, generating a VPE to predict specific complications rather than general risk levels. The results suggest that ensemble-based, patient-centered simulation can support clinical decision-making by offering interpretable, individualized predictions. Prospective validation is required before integration into practice. We thus provide surgeons with an app for experimenting with the agent and providing feedback for improvement.

5

Decentralized, privacy-preserving surgical video analysis with Swarm Learning

Saldanha, O. L.; Pfeiffer, K.; Bodenstedt, S.; Kirchner, M.; Jenke, A. C.; Barata, C.; Barbosa, S.; Barthel, J.; Carstens, M.; Castro, L. T.; Dehlke, K.; Dietz, S.; Emmanouilidis, S.; Fitze, G.; Freitag, M.; Holderried, F.; Kanjo, W.; Leitermann, L.; Mees, S. T.; Soares, A. S.; Pascoal, M.; Pistorius, S.; Prudlo, C.; Schultz, J.; Seiberth, A.; Thiel, K.; Wu, X.; Ziehn, D.; Speidel, S.; Weitz, J.; Distler, M.; Kather, J. N.; Kolbinger, F. R.

2025-10-03 surgery 10.1101/2025.10.02.25337106 medRxiv

Top 0.1%

70.3%

Show abstract

BackgroundProgress in artificial intelligence-based analysis of surgical videos has been constrained by reliance on manual frame-level annotations rather than patient-level outcomes. In addition, concerns about data privacy restrict the exchange of laparoscopic video data and, thereby, multicenter collaboration. MethodsTo address these limitations, we developed a pipeline that integrates weakly supervised deep learning with Swarm Learning, a decentralized machine learning approach that enables collaborative model training without data centralization. We evaluate our pipeline using a newly curated dataset of 397 laparoscopic appendectomy recordings from six international surgical centers. We identified optimal modelling configurations (frame sampling rates and model architectures) and subsequently compared Swarm Learning to single-center and centralized learning across three novel patient-level disease staging tasks: (i) binary detection of perforated appendicitis, (ii) laparoscopic grading of appendicitis, and (iii) histopathologic inflammation grading. In addition, we surveyed participating centers to identify real-world barriers to the clinical implementation of our decentralized learning pipeline for surgical video analysis. ResultsFor appendicitis grading tasks, frame sampling at 1.0 frames per second and use of the SurgTempoNet architecture resulted in reliable classification performance, outperforming SurgFrameNet and Multiple Instance Learning. Across all three disease staging tasks, Swarm Learning consistently outperformed single-center training and achieved performance comparable to centralized learning, with stable generalization in external validation. The user survey identified hardware failure and limited integration of the decentralized learning pipeline with electronic patient records as key barriers to the clinical implementation of our decentralized learning pipeline for collaborative surgical video analysis. ConclusionsWeakly supervised deep learning enables the prediction of patient-level endpoints directly from surgical video data. Swarm Learning facilitates privacy-preserving multicenter collaboration and achieves performance on par with centralized learning, highlighting its potential for advancing clinically relevant, collaborative AI development in surgical video analysis, especially when integrated with patients electronic health records. Article DescriptionThis study introduces a decentralized, privacy-preserving pipeline that combines weakly supervised deep learning with Swarm Learning to predict patient-level outcomes from laparoscopic appendectomy videos. Using data from six international surgical centers, the approach demonstrated performance comparable to centralized learning across three disease staging tasks while preserving data confidentiality by design.

6

Patient-Centric Markov-Chain Framework for Predicting Medication Adherence Using De-Identified Data

Dantuluri, A. V. S. R.

2026-02-10 health informatics 10.64898/2026.02.08.26345856 medRxiv

Top 0.1%

66.5%

Show abstract

Long-term adherence to prescribed therapies remains a persistent challenge in chronic and ultra-rare conditions where clinical outcomes depend on continuous medication use. Even brief gaps in therapy can compromise disease control, yet patients frequently encounter structural barriers including high out-of-pocket costs, prior-authorization (PA) delays, annual re-verification cycles, and refill logistics that disrupt persistence. This study evaluates a patient-centric Markov-chain framework for adherence risk stratification trained on eight years of de-identified specialty-pharmacy data representing 1,200 active patients. Certified data aggregators supply longitudinally linkable, tokenized data to preserve privacy while enabling multi-year adherence trajectory modeling. Transition probabilities between fully adherent, partially adherent, and lapsed states are estimated and adjusted using covariates such as age, duration on therapy, refill cadence, PA processing time, copay burden, and foundation-assistance status. The model achieves an accuracy of 0.82, 0.79 F1-score, and an AUC of 0.87, with 95% confidence intervals estimated via bootstrapping across cross-validated folds. Results highlight cost exposure, administrative friction, and mid-treatment duration (1-5 years) as dominant predictors of future non-adherence. Findings demonstrate how probabilistic modeling of privacy-preserved real-world data can support equitable patient-assistance strategies, identifying individuals vulnerable to systemic barriers rather than emphasizing commercial performance metrics.

7

High-Fidelity Synthetic Data Replicates Clinical Prediction Performance in a Million-Patient Diabetes Cohort

de la Oliva-Roque, V. M.; Kreil, D. P.; Dopazo, J.; Ortuno, F.; Loucera, C.

2025-07-21 health informatics 10.1101/2025.07.20.25331852 medRxiv

Top 0.1%

66.5%

Show abstract

Synthetic data generated using generative models trained on real clinical data offers a promising solution to privacy concerns in health research. However, many efforts are limited by small or demographically narrow training datasets, reducing the generalizability of the synthetic data. To address this, we used real-world clinical data from nearly one million individuals with diabetes in the Andalusian Population Health Database (BPS) to generate a comprehensive longitudinal synthetic dataset. We employed a dual adversarial autoencoder to produce synthetic data and evaluated its utility in a clinical machine learning (ML) task: predicting the onset of chronic kidney disease, a common diabetes complication. Models trained on synthetic data were assessed for their ability to reproduce patterns and predictive behaviors observed in real data. Performance and stability were compared across models trained on real, synthetic, and hybrid datasets. Models trained exclusively on synthetic data achieved AUROC scores comparable to real-data models (0.70 vs. 0.73) and showed high stability in feature importance rankings (weighted Kendalls {tau} > 0.9). Notably, combining synthetic and real data did not improve performance. Our findings demonstrate that high-fidelity synthetic longitudinal data can replicate real data performance in clinical ML, supporting its use in research while preserving patient privacy. This represents a significant step toward more collaborative and privacy-preserving healthcare data ecosystems.

8

Population differences in wearable device wear time: Rescuing data to address biases and advance health equity

Hurwitz, E.; Connelly, E.; Sklerov, M.; Master, H.; Hochheiser, H.; Butzin-Dozier, Z.; Dunn, J.; Haendel, M. A.

2026-03-06 health informatics 10.64898/2026.03.06.26347799 medRxiv

Top 0.1%

65.9%

Show abstract

Wearable devices present transformative opportunities for personalized healthcare through continuous monitoring of digital biomarkers; however, individual variations in device wear time could mask or otherwise impact signal identification. Despite the widespread adoption of wearable devices in research, no comprehensive framework exists for understanding how wear time varies across populations or for addressing wear time-related biases in analysis. Using Fitbit data from 11,901 participants in the All of Us Research Program, we conducted the first large-scale systematic assessment of wearable device wear time across demographics, social determinants of health, lifestyle factors, mental health symptoms, and disease. Our findings revealed that wear time was higher among males and increased with age, income, and education, but decreased with depressive, anxiety, and anhedonia symptoms, with reductions more pronounced following clinical diagnoses compared to symptom-based classifications. Individuals with chronic conditions displayed differential levels of wear time compared to healthy controls. Critically, we demonstrate that the widely used [≥]10-hour daily compliance threshold, while appropriate for some research contexts, can disproportionately exclude days of data from disease populations: among individuals with major depressive disorder, 74.4% of data days were excluded compared to 20.9% for controls. We propose a flexible methodological framework including standard compliance thresholds, wear time covariate adjustment, metric normalization, propensity score matching, and adaptive thresholds that can be applied individually or in combination to optimize wearable data retention across diverse research contexts. These findings establish wear time as a critical methodological consideration for wearable device research and provide guidance for advancing equitable and rigorous digital health analytics.

9

Bayesian Sequential Modeling of Time-to-Urination for Dynamic ED Triage

Senda, A.; Takatsu, Y.; Ikebe, R.; Suginaka, H.; Morishita, K.; Endo, A.

2025-10-19 health informatics 10.1101/2025.10.16.25338202 medRxiv

Top 0.1%

65.8%

Show abstract

Triage tools in routine emergency care are largely static, failing to exploit simple behavioral cues clinicians notice in real time. Here, we developed a Bayesian, sequentially updating framework that integrates incoming cues to produce calibrated, time-consistent risk. Using a prospective single-center cohort of ambulance arrivals in Japan (February-August 2025; n=2,221), we evaluated time to first urination (TTU) as a proof-of-concept bedside cue for predicting hospital admission. Population-level fit to the cumulative admission curve was excellent (integrated squared error 0.002; RMSE 0.003; Kolmogorov-Smirnov 0.008; coverage 0.98). At the patient level, performance improved markedly with age/sex adjustment (AUC(t) 0.70 vs. 0.50 unadjusted), with lower Brier scores and positive calibration slopes. Platt recalibration refined probability scaling without altering discrimination, and decision-curve analysis showed small, favorable net benefit at common thresholds. This framework is readily extensible to multimodal inputs and external validation and is designed to complement, not replace, existing triage systems.

10

Developing better digital health measures of Parkinson's disease using free living data and a crowdsourced data analysis challenge

Sieberts, S.; Borzymowski, H.; Guan, Y.; Huang, Y.; Matzner, A.; Page, A.; Bar-Gad, I.; Beaulieu-Jones, B.; El-Hanani, Y.; Goschenhofer, J.; Javidnia, M.; Keller, M. S.; Li, Y.-c.; Venuto, C.; Saqib, M.; Smith, G.; Stanescu, A.; Zielinski, R.; the BEAT-PD DREAM Challenge Consortium, ; Jayaraman, A.; Evers, L. J. W.; Foschini, L.; Mariakakis, A.; Pandey, G.; Shawen, N.; Snyder, P.; Omberg, L.

2021-10-23 neurology 10.1101/2021.10.20.21265298 medRxiv

Top 0.1%

65.7%

Show abstract

One of the promising opportunities of digital health is its potential to lead to more holistic understandings of diseases by interacting with the daily life of patients and through the collection of large amounts of real world data. Validating and benchmarking indicators of disease severity in the home setting is difficult, however, given the large number of confounders present in the real world and the challenges in collecting ground truth data in the home. Here we leverage two datasets with continuous wrist-worn accelerometer data coupled with frequent symptom reports in the home setting, to develop digital biomarkers of symptom severity. Using these data, we performed a public benchmarking challenge in which participants were asked to build measures of severity across 3 symptoms (on/off medication, dyskinesia, and tremor). 42 teams participated and performance was improved over baseline models for each subchallenge. Additional ensemble modeling across submissions further improved performance, and the top models validated in a subset of patients whose symptoms were observed and rated by trained clinicians.

11

Large Language Models forecast Patient Health Trajectories enabling Digital Twins

Makarov, N.; Bordukova, M.; Rodriguez-Esteban, R.; Schmich, F.; Menden, M. P.

2024-08-16 health informatics 10.1101/2024.07.05.24309957 medRxiv

Top 0.1%

65.2%

Show abstract

BackgroundGenerative artificial intelligence (AI) accelerates the development of digital twins, which enable virtual representations of real patients to explore, predict and simulate patient health trajectories, ultimately aiding treatment selection and clinical trial design. Recent advances in forecasting utilizing generative AI, in particular large language models (LLMs), highlights untapped potential to overcome real-world data (RWD) challenges such as missingness, noise and limited sample sizes, thus empowering the next generation of AI algorithms in healthcare. MethodsWe developed the Digital Twin - Generative Pretrained Transformer (DT-GPT) model, which utilizes biomedical LLMs using rich electronic health record (EHR) data. Our method eliminates the need for data imputation and normalization, enables forecasting of clinical variables, and preliminary explainability through a human-interpretable interface. We benchmarked DT-GPT on RWD including long-term US nationwide non-small cell lung cancer (NSCLC) and short-term Intensive Care Unit (ICU) datasets. FindingsDT-GPT surpassed state-of-the-art machine learning methods in patient trajectory forecasting on mean absolute error (MAE) for both the long-term (3.4% MAE improvement) and the short-term (1.3% MAE improvement) dataset. Additionally, DT-GPT was capable of preserving cross-correlations of clinical variables (average R2 of 0.98), handling data missingness and noise. Finally, we discovered the ability of DT-GPT to provide insights into a forecasts rationale and to perform zero-shot forecasting on variables not used during fine-tuning, outperforming even fully trained task-specific machine learning models on 13 clinical variables. InterpretationDT-GPT demonstrates that LLMs can serve as a robust medical forecasting platform, empowering digital twins which virtually replicate patient characteristics beyond their training data. We envision that LLM-based digital twins will enable a variety of use cases, including clinical trial simulations, treatment selection and adverse event mitigation.

12

AI-Driven Zero-Touch Network Orchestration for Tele-Radiology in Resource-Constrained Environments

Javed, M. Z.; Majeed, R.; Shafeeq, U.; Usman, H.; Ahmad, M.

2026-02-16 medical education 10.64898/2026.02.13.26346260 medRxiv

Top 0.1%

64.0%

Show abstract

BackgroundThe deployment of high-fidelity diagnostic Artificial Intelligence (AI) in resource-constrained environments is hindered by the stochastic nature of network latency and bandwidth limitations. Traditional tele-radiology relies on static cloud offloading, which introduces unacceptable latency for critical care scenarios. Zero-Touch Network and Service Management (ZSM) offers a paradigm for automated network orchestration, yet current frameworks lack application-layer awareness regarding clinical urgency and image complexity. MethodologyThis study proposes a novel Cross-Modal Latent Transformer (CMLT) integrated within a Zero-Touch Network Orchestration architecture. The system utilizes a lightweight Edge-Gating mechanism to dynamically partition inference tasks between edge nodes and cloud resources based on feature entropy. The model was trained and validated on the MIMIC-CXR (v2.0.0) (n = 377, 110) and CheXpert (n = 224, 316) datasets, employing a 70/10/20 split. ResultsThe proposed orchestration framework achieved an AUC-ROC of 0.962 [95% CI: 0.941-0.983] for Atelectasis detection, comparable to full-cloud inference, while reducing network bandwidth consumption by 64.3%. McNemars test indicated no statistically significant difference in diagnostic accuracy between the orchestrated hybrid approach and the full-precision cloud baseline (p > 0.05), despite a 120 ms reduction in mean inference latency. Clinical SignificanceBy embedding clinical feature extraction directly into the network orchestration logic, this framework enables real-time, zero-touch provisioning of diagnostic resources, facilitating reliable AI deployment in rural and bandwidth-limited clinical settings.

13

Evaluating the Influence of Demographic Identity in the Medical Use of Large Language Models

Lee, S.; Cho, W. I.; Lee, Y.; Park, C.; Park, C.; Ko, T.

2025-07-11 medical ethics 10.1101/2025.07.09.25331072 medRxiv

Top 0.1%

63.7%

Show abstract

As large language models (LLMs) are increasingly adopted in medical decision-making, concerns about demographic biases in AIgenerated recommendations remain unaddressed. In this study, we systematically investigate how demographic attributes--specifically race and gender--affect the diagnostic, medication, and treatment decisions of LLMs. Using the MedQA dataset, we construct a controlled evaluation framework comprising 20,000 test cases with systematically varied doctor-patient demographic pairings. We evaluate two LLMs of different scales: Claude 3.5 Sonnet, a highperformance proprietary model, and Llama 3.1-8B, a smaller open-source alternative. Our analysis reveals significant disparities in both accuracy and bias patterns across models and tasks. While Claude 3.5 Sonnet demonstrates higher overall accuracy and more stable predictions, Llama 3.1-8B exhibits greater sensitivity to demographic attributes, particularly in diagnostic reasoning. Notably, we observe the largest accuracy drop when Hispanic patients are treated by White male doctors, underscoring potential risks of bias amplification. These findings highlight the need for rigorous fairness assessments in medical AI and inform strategies to mitigate demographic biases in LLM-driven healthcare applications.

14

Developing and Testing an Engineering Framework for Curiosity-Driven and Humble AI in Clinical Decision Support

Arslan, J.; Benke, K.; Cajas, S.; Castro, R.; Celi, L. A.; Cruz Suarez, G. A.; Delos Reyes, R.; Engelmann, J.; Ercole, A.; Hilel, A.; Kalla, M.; Kinyera, L.; Lange, M.; Lunde, T. M.; Meni, M. J.; Ocampo Osorio, F.; Premo, A.; Sedlakova, J.; Vig, P.

2026-02-07 health informatics 10.64898/2026.02.06.26345664 medRxiv

Top 0.1%

63.7%

Show abstract

BackgroundWe present BODHI (Balanced, Open-minded, Diagnostic, Humble, and Inquisitive), an engineering framework for curiosity-driven and humble clinical decision support AI. Despite growing capabilities, large language models (LLMs) often express inappropriate confidence, conflating statistical pattern recognition with genuine medical understanding. BODHI addresses this through a dual-reflective architecture that: (1) decomposes epistemic uncertainty into task-specific dimensions, and (2) constrains model responses using virtue-based stance rules derived from a Virtue Activation Matrix. MethodsWe validate the framework through controlled evaluation on 200 clinical vignettes from HealthBench Hard, assessing GPT-4o-mini and GPT-4.1-mini across 5 random seeds (1,800 total observations). Statistical analysis included bootstrap resampling, paired t-tests, and effect size computation (Supplementary Materials S3) FindingsBODHI significantly improved overall clinical response quality (GPT-4.1-mini: +17.3pp, p < 0.0001, Cohens d = 0.50; GPT-4o-mini: +7.4pp, p < 0.0001, Cohens d = 0.22) while achieving very large effect sizes on curiosity (context-seeking rate: Cohens d = 16.38 and 19.54) and humility (hedging: d = 5.80 for GPT-4.1-mini) metrics. Crucially, 97.3% of GPT-4.1-mini responses and 73.5% of GPT-4o-mini responses included appropriate clarifying questions, compared to 7.8% and 0.0% at baseline, demonstrating the frameworks effectiveness in eliciting information-gathering behavior. InterpretationThese findings suggest LLMs can be reliably constrained to operate within epistemic boundaries when provided with structured uncertainty decomposition and virtue-aligned response rules, offering a pathway toward safer clinical AI deployment.

15

Surgical Information Assistant: A technical report on an agentic information retrieval System for surgical information

Bhattacharyya, K.

2025-05-21 surgery 10.1101/2025.05.20.25328046 medRxiv

Top 0.1%

63.6%

Show abstract

We present the Surgical Information Assistant, an agentic retrieval-augmented generation (RAG) system designed to improve access to surgical knowledge in resource-constrained settings. Built on the Open Manual of Surgery for Resource-Limited Settings, the assistant uses a retrieval-method we call DeRetSyn (Decom-pose-Retrieve-Synthesize). We evaluate DeRetSyn using automated metrics and partial human validation across 14,500 synthesized question-answer pairs and find that it achieves 63% top-1 accuracy using a 3B Llama model - outperforming GPT-4o (42.5%) without RAG and a 8B Llama model with conventional RAG ([~=]53%) while being significantly smaller and more computationally efficient. We also find that the DeRetSyn system with the Llama 3B model outperforms GPT-4o on the publicly available PubMedQA dataset on overall accuracy under specific prompting patterns. The Surgical Information Assistant demonstrates how agentic orchestration can extend the capabilities of small language models and offers a deployable framework for point-of-care medical decision support, education, and QA in low-bandwidth environments. We plan to release our benchmark dataset, codebase, prompt library, and RAG evaluation results for all categories for the entire dataset along with chain-of-thought reasoning from GPT-4o, Llama-3.1-8B, and Llama-3.2-3B upon publication.

16

Human Evaluators vs. LLM-as-a-Judge: Toward Scalable, Real-Time Evaluation of GenAI in Global Health

Williams, G.; Rutunda, S.; Nzabakira, F.; Mateen, B.

2025-10-28 public and global health 10.1101/2025.10.27.25338910 medRxiv

Top 0.1%

63.2%

Show abstract

Evaluating the outputs of generative AI (GenAI) models in healthcare remains a significant bottleneck for the safe and scalable deployment of these tools. Human expert raters remain the gold standard for assessing the accuracy, contextual appropriateness, and empathy of AI-generated responses, but their assessments are costly, inconsistent, and difficult to scale. The concept of "LLM-as-a-judge" systems, i.e., AI models that can evaluate other AI outputs, has been recently proposed; however, their reliability in global health contexts remains untested. In this study, we systematically compared five LLM-judges and six expert human clinicians in evaluating both human- and AI-generated responses to real-world questions submitted by Rwandan community health workers seeking clinical decision support. Using an adapted version of the Med-PaLM 2 evaluation framework, evaluators scored responses across 11 criteria. Our results show that even the highest-performing LLM-judge (Claude-4.1-Opus) achieved human-equivalent evaluations on only four of eleven criteria. Constructing "LLM juries" to balance model-specific biases improved agreement on only one additional criterion. Some models were consistently overcritical (GPT-5) or overly lenient (Gemini-2.5-Pro). Moreover, performance and cost-effectiveness deteriorated substantially when moving from English to Kinyarwanda inputs. Overall, while LLM-judges demonstrate potential as scalable and internally consistent evaluators of GenAI outputs in healthcare, their sensitivity to linguistic and cultural context is a critical limitation. These findings underscore the need for further investment in scalable evaluation solutions, as well as potentially a fundamental rethink of how we approach the concept of "correctness" in clinical AI assessment (which is currently based on highly inconsistent expert clinician raters).

17

ChatGPT Influence on Medical Decision-Making, Bias, and Equity: A Randomized Study of Clinicians Evaluating Clinical Vignettes

Goh, E.; Bunning, B.; Khoong, E.; Gallo, R.; Milstein, A.; Centola, D.; Chen, J. H.

2023-11-27 health informatics 10.1101/2023.11.24.23298844 medRxiv

Top 0.1%

62.8%

Show abstract

In a randomized, pre-post intervention study, we evaluated the influence of a large language model (LLM) generative AI system on accuracy of physician decision-making and bias in healthcare. 50 US-licensed physicians reviewed a video clinical vignette, featuring actors representing different demographics (a White male or a Black female) with chest pain. Participants were asked to answer clinical questions around triage, risk, and treatment based on these vignettes, then asked to reconsider after receiving advice generated by ChatGPT+ (GPT4). The primary outcome was the accuracy of clinical decisions based on pre-established evidence-based guidelines. Results showed that physicians are willing to change their initial clinical impressions given AI assistance, and that this led to a significant improvement in clinical decision-making accuracy in a chest pain evaluation scenario without introducing or exacerbating existing race or gender biases. A survey of physician participants indicates that the majority expect LLM tools to play a significant role in clinical decision making.

18

The Neurosurgical Uncertainty Index: Self-Doubting AI for rare or unexpected surgical complications

Thiong'o, G. M.; Ogundokun, A.

2025-05-02 surgery 10.1101/2025.05.01.25326833 medRxiv

Top 0.1%

62.2%

Show abstract

Rare or unexpected postoperative neurosurgical complications pose a challenge due to clinical variability and gaps in available data. We introduce the Neurosurgical Uncertainty Index (NUI), an uncertainty-aware AI framework that integrates bootstrap sampling for aleatoric uncertainty, isolation forest anomaly detection, and clinical calibration to predict and stratify risks for 13 complications. NUI distinguishes between data-driven and model-driven uncertainty and highlights cases that conventional models often miss. In a cohort of 80 patients, the hybrid Rare Event Score (anomaly x uncertainty) achieved critical risk stratification with an AUROC of 0.92 (95% CI 0.85-0.97) for complications requiring intervention, demonstrating precision 89% for critical cases (Score < 0.8). Entropy thresholds (> 1.5 nats) flagged 18% of predictions for review, preventing three overconfidence errors. Interpretable risk tiers are designed to integrate seamlessly with clinical workflows. By merging machine learning, neurosurgery, and epistemology, NUI promotes AI that acknowledges its limitations, with the aim of safer surgery.

19

Patient digital twins: an introduction based on a scoping review

Drummond, D.; Gonsard, A.

2024-02-21 health informatics 10.1101/2024.02.20.24303096 medRxiv

Top 0.1%

61.7%

Show abstract

The concept of digital twins, widely adopted in industry, is entering healthcare. In this scoping review, we analysed definitions and characteristics of patient digital twins being developed for clinical use. Searching for studies claiming digital twin development/evaluation until August 2023, we identified 86 articles representing 80 unique claimed digital twins, nearly all (98%) in preclinical phases. From the analysis of definitions and characteristics, we propose to define patient digital twin as "a viewable digital replica of a patient, organ, or biological system that contains multidimensional, patient-specific information". Two main forms were found: simulation digital twins using computational modelling of patient anatomy/physiology to run personalised outcome predictions and therapy evaluations, mostly for one-time assessments; and monitoring digital twins harnessing aggregated patient data for continuous risk/outcome forecasting over time and care optimisation. As patient digital twins rapidly emerge, the proposed definitions and subtypes offer a framework to guide research into realising the potential of these personalised, integrative technologies to advance clinical care.

20

Artificial Intelligence in Medicine: Revolutionizing Healthcare Practices and Patient Outcomes

Budi Susilo, Y. K.; Abdul Rahman, S.; Yuliana, D.; Abdul Rasyid, F.

2025-03-23 public and global health 10.1101/2025.03.23.25324467 medRxiv

Top 0.1%

61.3%

Show abstract

Artificial intelligence (AI) has transformed medicine, advancing diagnostics, treatment, and patient outcomes. This study employs bibliometric analysis and the PRISMA framework to explore AI research trends in medicine from 2019 to 2023. Key findings reveal significant growth in publications, with radiomics, genomics, and predictive analytics as major focus areas. Federated learning frameworks and wearable technologies emerge as critical innovations, addressing challenges such as data privacy and real-time monitoring. Despite these advancements, significant barriers persist, including algorithmic bias, data confidentiality, and the need for infrastructure and training to integrate AI into clinical workflows. Future directions emphasize the importance of interdisciplinary collaboration, ethical AI development, and explainable models to ensure trust and equitable access. By addressing these challenges, AI has the potential to revolutionize healthcare systems, optimize resource allocation, and enhance personalized medicine. This study highlights the transformative role of AI in shaping the future of medicine and improving global health outcomes.